Overlay databank unlocks data-driven analyses of biomolecules for all

Tools based on artificial intelligence (AI) are currently revolutionising many fields, yet their applications are often limited by the lack of suitable training data in programmatically accessible format. Here we propose an effective solution to make data scattered in various locations and formats accessible for data-driven and machine learning applications using the overlay databank format. To demonstrate the practical relevance of such approach, we present the NMRlipids Databank—a community-driven, open-for-all database featuring programmatic access to quality-evaluated atom-resolution molecular dynamics simulations of cellular membranes. Cellular membrane lipid composition is implicated in diseases and controls major biological functions, but membranes are difficult to study experimentally due to their intrinsic disorder and complex phase behaviour. While MD simulations have been useful in understanding membrane systems, they require significant computational resources and often suffer from inaccuracies in model parameters. Here, we demonstrate how programmable interface for flexible implementation of data-driven and machine learning applications, and rapid access to simulation data through a graphical user interface, unlock possibilities beyond current MD simulation and experimental studies to understand cellular membranes. The proposed overlay databank concept can be further applied to other biomolecules, as well as in other fields where similar barriers hinder the AI revolution.

Supplementary Figure 1.Structure of the NMRlipids Databank.Manually added input data (blue boxes) include basic information on the simulation, permanent links to the raw data, and experimental data if available.The databank entries (red box) and analysis results (green boxes), at https://github.com/NMRlipids/Databank/tree/main/Data/Simulationsare automatically generated by the computer programs included in the NMRlipids Databank (yellow boxes).Because the raw data are not permanently stored but can be accessed based on the information in the Databank, this connection is marked with a dashed line.

B)
Supplementary Figure 3. Scatter plots and Pearson correlation coefficients, r, for the membrane area per lipid with X-ray scattering form factor minima (A and B), and for thickness with the average order parameter of the sn-1 acyl chain (B) and with the second minimum from X-ray scattering form factors (D) extracted from the NMRlipids databank.All correlation coefficients have p-value below 0.001.

9/24
Finding the best models for PC and PE mixtures Supplementary Figure 5. Top 50 simulations in the NMRlipids Databank ranked based on the C-H bond order parameter quality against experiments.The columns 2-4 show qualities for acyl chain order parameters (P tails ), headgroup order parameters (P hg ), all order parameters (P total ), and for X-ray scattering form factors (FF q ).Column 5 shows relative equilibration times for conformations (τ rel ).Note that the best possible order parameter quality is one, while the best possible form factor quality is zero.ID values in the last column can be used to identify each simulation in the databank.For the mean value in each bin, average weighted with the simulation lengths was used, and error bars show the standard error of the mean.Only bins with more than one microsecond of data in total were used for water permeation.Only simulations with the temperatures between 300-315 K were used in D.

NMR experiments
Acyl chain order parameters of POPE (Supplementary Figures 10 and 11) and POPG (Supplementary Figures 12 and 13) were analyzed from the same data that were previously recorded to determine headgroup order parameters 13 .The analysis of the crowded spectral region at 29-31 ppm was based on the previous assignment reported for POPC membranes 35 .To measure the order parameters for DOPC (Supplementary Figure 14), the sample was prepared and experiments performed similarly to previous studies 13 10.Determination of the POPE acyl chain order parameters from a R-PDLF spectrum measured at a magic angle spinning frequency of 5.15 kHz.(A) 13 C rINEPT spectrum with peak assignment.The labels used are shown in the chemical structure of POPE.The chemical shift of the methyl groups was defined as 13.8 ppm.(B) Contour plot of the R-PDLF spectrum for the crowded spectral region.The assignment was based on a previous assignment reported for POPC membranes 35 .(C) C-H bond order parameter profile for the acyl chains of POPE.The splittings used for calculating the order parameters are shown in Supplementary Figure 11.The unassigned peaks belong to the headgroup and glycerol backbone carbons.A detailed assignment and order parameter analysis of these carbons was shown previously 13 .Supplementary Figure 11.Dipolar spectra obtained from the 2D R-PDLF spectrum from POPE in Supplementary Figure 10.The number at the top left corner of each panel denotes the corresponding chemical shift.The carbon label for each splitting is displayed on the top right corner.The labels are the same as in Supplementary Figure 10.

15/24
16/24  Supplementary Figure 12.Determination of the POPG acyl chain order parameters from a R-PDLF spectrum measured at a magic angle spinning frequency of 5.15 kHz.(A) 13 C rINEPT spectrum with peak assignment.The labels used are shown in the chemical structure of POPG.The chemical shift of the methyl groups was defined as 13.8 ppm.(B) Contour plot of the R-PDLF spectrum for the crowded spectral region.The assignment was based on a previous assignment reported for POPC membranes 35 .(C) C-H bond order parameter profile for the acyl chains of POPG.The splittings used for calculating the order parameters are shown in Supplementary Figure 13.The unassigned peaks belong to the headgroup and glycerol backbone carbons.A detailed assignment and order parameter analysis of these carbons was shown previously 13 .Supplementary Figure 13.Dipolar spectra obtained from the 2D R-PDLF spectrum described in Supplementary Figure 12.

17/24
The number at the top left corner of each panel denotes the corresponding chemical shift.The carbon label for each splitting is displayed on the top right corner.The labels are the same as in Supplementary Figure 12.C9 Supplementary Figure 15.Dipolar spectra obtained from the 2D R-PDLF spectrum described in Supplementary Figure 14.
The number at the top left corner of each panel denotes the corresponding chemical shift.The carbon label for each splitting is displayed on the top right corner.The labels are the same as in Supplementary Figure 14.

Supplementary Figure 4 .
Dependence of the form factor F(q z ), the electron density profiles along membrane normal, and the C-H bond order parameters S CH (from top to bottom) on the simulation box size (with different columns showing different cholesterol concentrations).Simulations with 64, 256, and 1024 POPC lipids are from Ref.30.

Supplementary Figure 6 .Supplementary Figure 8 .Supplementary Figure 9 .
Simulations with the data for both POPC (top) and POPE (bottom) directly compared with the experimental data.The area per lipid increases from left to right.Simulations with the best overall quality for POPC and POPE order parameters are highlighted with a solid border.Water permeation through membranes analyzed from the Databank as a function of (A) hydration level, (B) fraction of cholesterol, (C) fraction of charged lipids, and (D) fraction of POPE in membrane.Values from simulations with non-zero permeation values are shown with blue dots.Histogrammed values are shown with black dots For the mean value in each bin, average weighted with the simulation lengths was used, and error bars show the standard error of the mean.Only bins with more than one microsecond of data were used.Only simulations with the temperatures between 300-315 K were used.Lateral diffusion of water as a function of (A) area per lipid, (B) temperature, (C) membrane thickness, and (D) fraction of charged lipids in a membrane.Non-zero permeation and diffusion values from simulations are shown with blue dots.Histogrammed values are shown with black dots.

Supplementary Figure 14 .
Determination of DOPC order parameters from R-PDLF spectrum measured at a magic angle spinning frequency of 5.15 kHz.(A)13 C rINEPT spectrum with peak assignment.The labels used are shown in the chemical structure of DOPC.The chemical shift of the methyl groups was defined as 13.8 ppm.(B) Contour plot of the R-PDLF spectrum for the crowded spectral region.The assignment was based on a previous assignment reported for POPC membranes35 .(C) C-H bond order parameters of DOPC.

.
List of current force fields used in simulations in the Databank, with references.

Table 2 .
Examples of codes that analyze membrane properties from the Databank in an Application layer available at https://github.com/NMRLipids/DataBankManuscript/.

Table 3 .
Keys stored in the README.yamlfiles of simulations.

Table 4 .
Keys stored in the README.yamlfiles of experiments.

Table 5 .
List of relevant codes used to build the Databank and perform analyses in the Databank layer available at https://github.com/NMRLipids/Databank/.

molecule and atoms names Clone the Databank and Application layer
Supplementary Figure2.Flowchart for accessing results calculated from the NMRlipids Databank and stored to the Application layer. .